Enable no-src GPU SDMA transfers#295
Open
nileshnegi wants to merge 2 commits into
Open
Conversation
Allow EXE_GPU_DMA transfers with zero sources to perform a memset using hsa_amd_memory_fill, which enqueues a LINEAR_FILL operation on the SDMA engines. Fill value: uint32_t fillVal = bit_cast<uint32_t>(MEMSET_VAL); // 0x4B4B4B4B count = numBytes / sizeof(uint32_t); // count is in uint32_t units 0x4B4B4B4B matches both memset(MEMSET_CHAR) (used by dstReference[0]) and MemsetVal<float>() used by the GFX no-src kernel, so existing correctness validation passes without changes. Validation changes (AMD only, gated on !__NVCC__): - DMA no-src is now valid; rejected only on NVIDIA builds - DMA no-src with a specific SDMA engine (e.g. "n d0.2 g1") is rejected because hsa_amd_memory_fill has no engine-selection parameter - Copy-agent-selection warnings guarded by !t.srcs.empty() to avoid out-of-bounds access when no source is specified Execution changes (ExecuteDmaTransfer): - no-src hoisted before hipMemcpy/HSA-async-copy branches - Copy paths (hipMemcpy and HSA async copy) unchanged HSA resource setup: - srcMem pointer-info query guarded by !rss.srcMem.empty() Co-authored-by: Claude <claude@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR extends the EXE_GPU_DMA executor to accept “0-source” transfers on AMD platforms, enabling SDMA-based memset/fill behavior without introducing a new executor type.
Changes:
- Relax DMA transfer validation from “exactly 1 source” to “0 or 1 source”, while rejecting 0-src DMA on NVIDIA builds.
- Add a 0-src DMA execution path that uses
hsa_amd_memory_fill()to fill destinations with the existingMEMSET_VALbyte pattern. - Guard source-agent setup and DMA copy-agent-selection warnings so they don’t dereference
srcs[0]when the transfer has no sources.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
hsa_amd_memory_fill does not record HIP events, so querying hipEventElapsedTime after a 0-src DMA transfer produced an "invalid resource handle" error. Guard the HIP event timing path with !resources.srcMem.empty(); the fill path falls back to CPU wall-clock time, which is accurate since hsa_amd_memory_fill is synchronous. Co-authored-by: Claude <claude@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Extends EXE_GPU_DMA to accept zero-source transfers, enabling SDMA-driven memset without introducing a new executor type.
Technical Details
Fill value:
0x4B4B4B4Bmatches both memset(MEMSET_CHAR) (used by dstReference[0]) and MemsetVal() used by the GFX 0-src kernel, so existing correctness validation passes without changes.Validation:
srcs.size() != 1 → srcs.size() > 1; 0 sources now valid on AMDexeSubIndex != -1results in fatal errorif (!t.srcs.empty())Resource setup:
hsa_amd_pointer_infoonsrcMem[0]guarded by!rss.srcMem.empty()Constraints:
hsa_amd_memory_fillhas no engine parameter/mask, so combining 0 sources with an engine subindex (e.g.,n d0.2 g1) is rejected at validation.Test Plan
Test Result
Submission Checklist